My public Github repository for final assignment can be found here
In the United Kingdom, police powers to stop-and-search members of the public are currently regulated by various pieces of legislation. However, as a growing body of research documents shows that there still exists some bias in who experiences stop and search by the police, with most of them focusing on the identity of the targeted people, Which is included in my research.
Expanding previous studies that focused on who tends to be stopped and searched by police officers, I particularly studied on where S&S concentrates and investigate the effect of economic inequality. Besides, owe to the UK Police Data, which provide a wealth of textual information related to police priority, extra attention was paid to how those potential criminal issues report influence stop-and-search of local police force.
I chose London as my study area, using the latest S&S data in December 2023. There are many administrative or census boundaries in the UK, such as LSOA(Lower Layer Super Output Area) and LAD (Local Authority District), etc., their spatial data and lookup tables can be found on the from Office for National Statistics of UK. In the UK, Lower Authority Districts refer to a level of local government that is subordinate to larger administrative divisions such as counties or regions. After some attempts I finally chose to aggregate the S&S data into LAD, there are total 50 LADs in London which are suitable for modeling with the analytic method I chose.
My data are mainly from two sources: At Home | data.police.uk I used the official API to retrieve Stop-and-Search data (Hereinafter, it will be referred to as S&S) and all traceable neighborhood priority texts for the study area over the study period. At Open Geography Portal (statistics.gov.uk), I downloaded the 2021 UK Census Data and Index of Multiple Deprivation (IMD) data, as well as some of London’s boundary data, and load them into R as csv and shapefile data, converting to tibble or data frame if performing analysis and calculation need. The structure of the final dataset mainly includes data frame, tibble, simple feature(sf), and document featured matrix (dfm).
Gathering S&S Data from API
S&S Data Preprocessing
I gathered S&S data in London during December 2023 from UK Police API. After cleaning the data into a tidy table, it contains information of date, location, personal identity and details of S&S.
## # A tibble: 6,906 × 18
## datetime gender age_range self_defined_ethnicity
## <dttm> <chr> <chr> <chr>
## 1 2023-11-02 10:31:00 Male 25-34 Black/African/Caribbean/Black British -…
## 2 2023-11-02 10:31:00 Female 25-34 Black/African/Caribbean/Black British -…
## 3 2023-11-03 14:30:00 Male 18-24 Asian/Asian British - Indian
## 4 2023-11-03 22:30:00 Male 18-24 White - English/Welsh/Scottish/Northern…
## 5 2023-11-03 22:30:00 Male over 34 White - English/Welsh/Scottish/Northern…
## 6 2023-11-05 12:32:00 Male 18-24 Asian/Asian British - Pakistani
## 7 2023-11-05 12:32:00 Male 25-34 Asian/Asian British - Pakistani
## 8 2023-11-05 12:50:00 Male 18-24 Asian/Asian British - Any other Asian b…
## 9 2023-11-06 09:07:00 Male over 34 White - Any other White background
## 10 2023-11-07 11:00:00 Male 25-34 Black/African/Caribbean/Black British -…
## # ℹ 6,896 more rows
## # ℹ 14 more variables: officer_defined_ethnicity <chr>, longitude <chr>,
## # latitude <chr>, object_of_search <chr>, outcome <chr>,
## # outcome_linked_to_object_of_search <chr>,
## # removal_of_more_than_outer_clothing <chr>, street_id <chr>,
## # street_name <chr>, force <chr>, PFA_name <chr>, type <chr>,
## # involved_person <chr>, legislation <chr>
This side-by-side stacked bar charts shows both compositions of searched objects and outcome. The searched objects reflect the crime type the violator is suspected, which can be further discussed later in the research. The outcomes show the final conviction and if there’s a no further action disposal, it means that the police are more likely to misjudge or over oversearch without enough reasons due to some prejudice.
Therefore, I focus on the misjudged S&S data, which may help find the hidden bias after excluding certain groups of people with high crime rates.
Calculating Proportion of different Identities
Plot Pie Chart and Stacked Bar Chart
The pie chart shows the proportions of people with different identities under stop-and-search. From this we can see, males are far more likely to be under S&S than females. People in each age range that over 10 have a similar proportion in the Age groups in S&S, with those over 34 having the largest proportion. According to the ethnicity defined by officer, White people account for the largest proportion, followed by Black people and Asian People, but this may be because the total number of white people in the UK is the largest.
Similar but detailed information can be found in the Self-defined
Ethnicity of People under S&S, where I plot the stacked bar chart
with facet_wrap() divided by 5 ethnic groups.
Considering with the potential features the violators most likely to have, and the proportion of the local population, the statistical results are in line with expectations.
In order to reduce the influence of demographics on the identity structure under S&S, I collected UK Census data in 2021 and zoomed the data into London by limiting the LAD code to London’s. Because the statistical aperture difference between different dataset, I also build a lookup to switch between different boundary ID.
I summarised the age range, gender and ethnicity data of Census data according to the classification in S&S data as shown below.
## # A tibble: 3 × 7
## LAD22CD age_total under_10_prop `10-17_prop` `18-24_prop` `25-34_prop`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 E06000034 172378 0.137 0.130 0.0543 0.141
## 2 E06000039 151940 0.151 0.144 0.0564 0.144
## 3 E06000060 529692 0.116 0.121 0.0457 0.112
## # ℹ 1 more variable: over_34_prop <dbl>
## # A tibble: 3 × 7
## LAD22CD ethnic_total Asian_prop Black_prop White_prop Mixed_prop Other_prop
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 E06000034 172407 0.0646 0.109 0.784 0.0283 0.0146
## 2 E06000039 151948 0.454 0.0752 0.375 0.0405 0.0558
## 3 E06000060 529697 0.118 0.0237 0.810 0.0337 0.0149
## # A tibble: 3 × 4
## LAD22CD sex_total Male_prop Female_prop
## <chr> <int> <dbl> <dbl>
## 1 E06000034 172403 0.489 0.511
## 2 E06000039 151960 0.494 0.506
## 3 E06000060 529713 0.489 0.511
Disproportionality is a term used to describe situations in which a particular group is over-represented or under-represented in police S&S practices compared to its proportion of the general population. A simple calculation method is (proportion of specific group in S&S / proportion in total population) - 1. So if this value is greater than 0, then the group is relatively overrepresented in S&S.
The following charts show the disproportionality of sex group, age group and ethnicity group, indicating that male, young people around 18-24 and the black group and other group (which are very special or complex that hard to be defined) are the overrepresented group. The findings in ethnic group is noticeable because finally the potential racial prejudice is discovered.
Data spacial matching into each LAD
Calculate the S&S rate in each LAD
Calculate the correlation coefficient
In the discussion above we found that police S&S has biases against certain specific groups, then will the identity structure in a region affect the rate of misjudged S&S?
Here I aggregated the Census data and S&S data by LAD ID matching
and spatial matching and obtain the summary table
London_LAD_ sf. Then I selected the proportion of
gender, age, and ethnicity variables in the total population in each
area for correlation with misjudged Stop-and-Search incidents per 10,000
people.
## Simple feature collection with 3 features and 22 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -0.3833711 ymin: 51.32638 xmax: -0.2172895 ymax: 51.41204
## Geodetic CRS: WGS 84
## # A tibble: 3 × 23
## LAD21CD LAD21NM geometry SS_count ethnic_total Asian_prop
## <chr> <chr> <MULTIPOLYGON [°]> <dbl> <int> <dbl>
## 1 E07000207 Elmbridge (((-0.3190702 51.39291, … 0 137764 0.0655
## 2 E07000208 Epsom an… (((-0.2209677 51.32986, … 0 77785 0.111
## 3 E07000210 Mole Val… (((-0.3066239 51.33492, … 0 87370 0.0295
## # ℹ 17 more variables: Black_prop <dbl>, White_prop <dbl>, Mixed_prop <dbl>,
## # Other_prop <dbl>, age_total <dbl>, under_10_prop <dbl>, `10-17_prop` <dbl>,
## # `18-24_prop` <dbl>, `25-34_prop` <dbl>, over_34_prop <dbl>,
## # sex_total <int>, Male_prop <dbl>, Female_prop <dbl>, Pop_density <dbl>,
## # area <dbl>, Population <dbl>, S_S_prop <dbl>
## [1] "LAD21CD" "LAD21NM" "geometry" "SS_count"
## [5] "ethnic_total" "Asian_prop" "Black_prop" "White_prop"
## [9] "Mixed_prop" "Other_prop" "age_total" "under_10_prop"
## [13] "10-17_prop" "18-24_prop" "25-34_prop" "over_34_prop"
## [17] "sex_total" "Male_prop" "Female_prop" "Pop_density"
## [21] "area" "Population" "S_S_prop"
The following graph shows the correlations between the 12 variables, with blue being the positive relationship and red the opposite, Indicating that age group is highly related to the S&S rate, where the lower proportion of the group aged under 17, the higher proportion of the group aged 18 to 34, the higher rate of S&S in local, which makes sense because drugs, violence, theft, etc. are more likely to occur on the young adult. In the Ethnic group, only the higher proportion of the white group, the lower rate of S$S in local, and higher proportions of all other groups will increase the region’s overall S&S to some extent.
Plot London basemap using stadiamaps API
Plot Stop-and-Search Heatmap
The heatmap shows that the stop-and-search concentrate spatially in the central part of London with the northeast part of it having higher S&S density.
The Indices of Multiple Deprivation (IMD) in 2019 provide a set of relative measures of deprivation for small areas (Lower-layer Super Output Areas, LSOA) across England, based on seven domains of deprivation: Income , Employment, Education, Health, Crime, Housing Services, Living Environment. According to related literature,some social disorganization indicators, which are intrinsically related to economic inequality, also predict variation in police behavior.
So I download the IMD indices and add them as complement of current identity variables to better understand the Police S&S practices. What shown below is the final tibble with all of variables involved in this research.
## # A tibble: 50 × 30
## LAD21CD LAD21NM geometry SS_count ethnic_total Asian_prop
## <chr> <chr> <MULTIPOLYGON [°]> <dbl> <int> <dbl>
## 1 E07000207 Elmbrid… (((-0.3190702 51.39291, … 0 137764 0.0655
## 2 E07000208 Epsom a… (((-0.2209677 51.32986, … 0 77785 0.111
## 3 E07000210 Mole Va… (((-0.3066239 51.33492, … 0 87370 0.0295
## 4 E09000021 Kingsto… (((-0.2450542 51.38004, … 102 167046 0.178
## 5 E09000024 Merton (((-0.1892628 51.43827, … 82 212564 0.183
## 6 E09000027 Richmon… (((-0.3286939 51.39231, … 40 195306 0.0886
## 7 E09000032 Wandswo… (((-0.1270813 51.48358, … 184 317106 0.115
## 8 E09000029 Sutton (((-0.1565688 51.32151, … 54 206637 0.173
## 9 E07000211 Reigate… (((-0.1243196 51.28676, … 0 144655 0.0714
## 10 E07000215 Tandrid… (((0.04236905 51.29267, … 0 86561 0.0351
## # ℹ 40 more rows
## # ℹ 24 more variables: Black_prop <dbl>, White_prop <dbl>, Mixed_prop <dbl>,
## # Other_prop <dbl>, age_total <dbl>, under_10_prop <dbl>, `10-17_prop` <dbl>,
## # `18-24_prop` <dbl>, `25-34_prop` <dbl>, over_34_prop <dbl>,
## # sex_total <int>, Male_prop <dbl>, Female_prop <dbl>, Pop_density <dbl>,
## # area <dbl>, Population <dbl>, S_S_prop <dbl>, Income_IMD <dbl>,
## # Employ_IMD <dbl>, Education_IMD <dbl>, Health_IMD <dbl>, Crime_IMD <dbl>, …
## [1] "LAD21CD" "LAD21NM" "geometry" "SS_count"
## [5] "ethnic_total" "Asian_prop" "Black_prop" "White_prop"
## [9] "Mixed_prop" "Other_prop" "age_total" "under_10_prop"
## [13] "10-17_prop" "18-24_prop" "25-34_prop" "over_34_prop"
## [17] "sex_total" "Male_prop" "Female_prop" "Pop_density"
## [21] "area" "Population" "S_S_prop" "Income_IMD"
## [25] "Employ_IMD" "Education_IMD" "Health_IMD" "Crime_IMD"
## [29] "Housing_IMD" "Living_IMD"
The following map illustrates the spatial inequality distribution in London, showing that in each domain of deprivation, there is a significant spatial heterogeneity.
Compare Model Fitting
Compare Regression Coefficients
Here I compared the result of regression model using identity features as variables and using identity features and spatial inequality index.
According to the results, The second model shows an improvement over the first model in terms of fit and explanatory power, particularly in the detailed impact of various aspects of the Index of Multiple Deprivation on the rate of S&S. This might be due to the second model including more explanatory variables, which help capture other important factors influencing dependent variable.
##
## Call:
## lm(formula = S_S_prop ~ ., data = selected_vars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1506 -1.1287 0.0657 1.4424 6.8746
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -150.66 35.84 -4.204 0.000143 ***
## Asian_prop -35.93 23.57 -1.524 0.135274
## Black_prop -23.42 24.26 -0.966 0.339990
## White_prop -26.79 20.37 -1.315 0.195917
## Mixed_prop 87.78 70.01 1.254 0.217183
## under_10_prop -48.53 79.89 -0.608 0.546944
## `10-17_prop` -78.12 93.80 -0.833 0.409875
## `18-24_prop` 174.75 67.06 2.606 0.012815 *
## `25-34_prop` -21.09 27.03 -0.780 0.439759
## Male_prop 377.62 55.36 6.821 3.33e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.873 on 40 degrees of freedom
## Multiple R-squared: 0.8797, Adjusted R-squared: 0.8526
## F-statistic: 32.49 on 9 and 40 DF, p-value: 1.178e-15
##
## Call:
## lm(formula = S_S_prop ~ ., data = selected_add_inequality_vars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1010 -1.5036 0.3633 1.6091 5.0577
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.162e+02 4.905e+01 -4.407 0.00011 ***
## Asian_prop 1.817e+01 2.592e+01 0.701 0.48838
## Black_prop 1.242e+01 2.324e+01 0.535 0.59659
## White_prop 2.044e+01 2.333e+01 0.876 0.38742
## Mixed_prop 1.021e+02 7.789e+01 1.311 0.19933
## under_10_prop -2.959e+01 7.715e+01 -0.384 0.70388
## `10-17_prop` 3.604e+01 1.059e+02 0.340 0.73576
## `18-24_prop` 6.716e+01 6.533e+01 1.028 0.31167
## `25-34_prop` 5.371e+01 4.647e+01 1.156 0.25624
## Male_prop 3.296e+02 6.724e+01 4.902 2.64e-05 ***
## Income_IMD -1.803e+02 1.093e+02 -1.649 0.10896
## Employ_IMD 4.913e+02 1.831e+02 2.684 0.01144 *
## Education_IMD 1.971e-02 1.509e-01 0.131 0.89686
## Health_IMD -6.167e+00 2.472e+00 -2.495 0.01795 *
## Crime_IMD -6.483e+00 2.158e+00 -3.005 0.00513 **
## Housing_IMD -1.511e-04 1.070e-01 -0.001 0.99888
## Living_IMD 1.990e-01 9.905e-02 2.009 0.05305 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.395 on 32 degrees of freedom
## (因为不存在,1个观察量被删除了)
## Multiple R-squared: 0.9326, Adjusted R-squared: 0.8989
## F-statistic: 27.69 on 16 and 32 DF, p-value: 2.793e-14
The figure below presents a comparison of the regression
coefficients of each variable in the two models. The
explanatory power of the new variables Employment_ID and
Income_ID in model 2 has replaced the original
Male_prop.
Geographically Weighted Regression (GWR) is a technique used in spatial data analysis that allow regression coefficients to vary across space. This differs from OLS linear regression, which assumes that relationships are constant across the entire study area. Therefore, I use GWR to investigate the local differences on those factors’ influence.
The modeling steps are as follows.
# define formula
formula <- S_S_prop ~ Asian_prop + Black_prop + White_prop + Mixed_prop +
under_10_prop + X10.17_prop + X18.24_prop + X25.34_prop +
Male_prop +
Income_IMD + Employ_IMD + Education_IMD + Health_IMD + Crime_IMD + Housing_IMD + Living_IMD
# automatic bandwidth selection
bw <- gwr.sel(formula, data = London_LAD_spatial, method = "AIC")
# gwr model fitting
gwr_model <- gwr(formula, data = London_LAD_spatial, bandwidth = bw, hatmatrix = TRUE)
I use tmap to draw the interactive map which adds all of
the variables’ coefficients layer. Different categories of variables are
visualized in different palettes. The yellow circles’ sizes represent
the rate of S&S.
It’s interesting that the coefficients of ethnicity variables share similar spatial distribution where the influence Increases from west to east while the coefficients of age variables is the opposite. The Male group and the group aged 18-24’s coefficients which are largest decrease from south to north. As for the IMD indices, the magnitudes and directions of variation of coefficients shows a great difference, complementing each other in the measurement of London’s spatial inequality.